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Abstract 

In the evaluation of quantum annealers, metrics based on ground state success rates have 
two major drawbacks. First, evaluation requires computation time for both quantum and 
classical processors that grows exponentially with problem size. This makes evaluation itself 
computationally prohibitive. Second, results are heavily dependent on the effects of analog 
noise on the quantum processors, which is an engineering issue that complicates the study of 
the underlying quantum annealing algorithm. We introduce a novel “time-to-target” metric 
which avoids these two issues by challenging software solvers to match the results obtained by a 
quantum annealer in a short amount of time. We evaluate D-Wave’s latest quantum annealer, 
the D-Wave 2X system, on an array of problem classes and find that it performs well on several 
input classes relative to state of the art software solvers running single-threaded on a CPU. 


1 Introduction 

The commercial availability of D-Wave’s quantum annealer^] in recent years 11121 El has led to 
many interesting challenges in benchmarking, including the identification of suitable metrics for 
comparing performance against classical solution methods. Several types of difficulties arise. First, 
a D-Wave computation is both quantum and analog, with nothing resembling a discrete instruction 
or basic operation that can be counted, as in classical benchmarking scenarios; with no common 
denominator for comparison we resort to runtime measurements, which are notoriously transient 
and unstable. A second set of issues arises from the fact that we compare an algorithm implemented 
in hardware to algorithms implemented in software; standard guidelines for benchmarking computer 
platforms, software, and algorithms dUEUSUZ]) do not consider this mixed scenario. 

Another difficulty, addressed in this paper, arises from the fact that the algorithms of interest 
are heuristics for an NP-hard optimization problem. Unlike decision problems, where the algorithm 
returns a solution that can be verified right or wrong, optimization problems allow heuristic solu¬ 
tions that may be better or worse, and that cannot be efficiently verified as optimal (unless P=NP). 
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The performance of a given heuristic on a given input is not described by a single number (time to 
find the solution), but rather by a curve that describes the trade-off between computation time and 
solution quality. Assuming that neither dominates the whole problem space, how do we decide if 
one curve is better than another? Barr et al. [I], Bartz-Beielstein and Preuss 0§2], and Johnson 
f7j discuss several ways to define performance metrics in this context. Parekh et al. (8] consider this 
question in the context of quantum annealing when discussing the merits of two possible success 
criteria: producing an optimal solution for 99% of instances, or always producing a solution within 
99% of optimal. 

Various performance metrics have appeared in the literature on empirical evaluation of D-Wave 
platforms compared to classical software solvers. The first significant effort at evaluation was the 
study by McGeoch and Wang [9] , in which D-Wave’s Vesuvius 5 and Vesuvius 6 generations of chips 
were compared against general purpose software solvers using two general input classes and one 
class that is native to the D-Wave chip structure (known as a Chimera-structured Ising problem). 
That paper used performance metrics based on best solutions found within fixed time limits. 

Subsequently several papers appeared that focus on time to find ground states (optimal so¬ 
lutions). Boixo et al. [3] QQ3], in comparisons to classical algorithms that also return samples of 
solutions (simulated annealing and simulated quantum annealing), introduced metrics that count 
samples-to-solution (STS) and measure time-to-solution (TTS), respectively the number of samples 
and the total time required by a solver to find a ground state with sufficiently high probability; 
these metrics have emerged as standards used in many subsequent papers (e.g. mm)- Rpnnow 
et al. fl3] describe methods for evaluating “quantum speedup” that involve comparisons of scaling 
curves corresponding to TTS over a range of problem sizes. 

Selby p3j uses a somewhat different approach based on the average time to find a “presumed 
ground state” several times in succession. Selby’s implementation m of the Hamze-de Freitas- 
Selby algorithm mm may be used for finding the presumed ground state energies of D-Wave’s 
native Chimera structured Ising Hamiltonians. When applied to one problem class in our test 
set (RAN7, see Section [4.1[ ) this algorithm takes on the order of 0.1 seconds for the graph of the 
D-Wave One system (2011), 15 seconds for the graph of the D-Wave Two system (2013), and 17 
minutes for the graph of the D-Wave 2X system (2015) [13] . and still does not guarantee optimality. 
Running CPLEX (a state-of-the-art commercial optimization solver m) to certify optimality takes 
several hours or days per input at the largest problem sizes. 

This illustrates a fundamental difficulty with metrics based on finding ground states: on suffi¬ 
ciently challenging inputs the time required for any solver grows exponentially with problem siz^] 
Thus the computation of STS and TTS for large inputs is intractable. 

Some approaches for getting around this difficulty in the context of evaluation are known. 
For example Hen et al. m (see also King m) shows how to construct inputs with “planted 
solutions” so that optimal solutions are known a priori by construction. This is an effective 
approach, but in general it can be difficult to argue that inputs constructed this way are sufficiently 
challenging, or representative of real-world inputs, to be interesting HU2Q]. Another idea is to find 
an easily-computed lower bound on the optimal solution and to use that bound as proxy for optimal 
(measuring, for example, time to get within a certain percentage of the bound), but nothing suitable 
appears to be available for the native problems solved by D-Wave quantum annealers. Saket [21j 
developed a PTAf0 for Chimera Ising models, but this is mostly of theoretical interest—to get 

2 Using known algorithms or assuming the exponential time hypothesis mi- 
2 Polynomial-time approximation scheme. 
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within a fraction e of the ground state energy takes 0(n • 2 32 / e ) time, impractical even to get 
within 10%. 

A second problem with using ground states in performance metrics is the fact that analog 
noise in D-Wave’s processors perturbs the input Hamiltonian. This means that the nominal input 
specified by the user differs slightly from the physical input solved by the processor [12, 19| f22l 
[231 EH- This problem is particularly pernicious for finding ground states since ground states and 
excited states in the nominal Hamiltonian may exchange roles in the physical Hamiltonian, and 
generally near-ground states far outnumber ground states. 

In this study we introduce a new metric that avoids the problem of prohibitive runtimes and 
makes evaluation far less sensitive to analog noise. We do this by having the solvers race to a 
target energy determined by the D-Wave processor’s energy distribution. We call this the “tirne- 
to-target” (TTT) metric. Our use of the D-Wave processor as a reference solver in computing the 
TTT metric allows us to circumvent the difficulties of evaluating performance in finding ground 
states, and to explore an interesting property that we have observed: very fast convergence to 
near-optimal solutions. 

The TTT metric identifies low-cost target solutions found by the D-Wave processor within very 
short time limits (from 15ms to 352ms in this study), and then asks how much time competing 
software solvers need to find solution energies of matching or better quality. We observe that the 
D-Wave processor performs well both in terms of total computation time (including I/O costs to 
move data on and off the chip), and pure computation times (omitting I/O costs). Our results 
may be summarized as follows. 

• Considering total time from start to finish (including I/O costs), D-Wave 2X TTT times are 
2x to 15x faster than the best competing software (at largest problem sizes), for all but one 
input class that we tested, in which a solver specific to that input class is faster. 

• Considering pure computation time (omitting I/O costs), D-Wave 2X TTT times are 8x to 
600x faster than competing software times on all input classes we tested. 

The next section presents an overview of D-Wave quantum annealing technology and describes 
the performance models for D-Wave quantum annealers and competing software solvers. Section 
[3] defines the TTT metric and a related STT (samples to target) metric. Section [4] describes 
our experimental setup, Section [5] presents experimental results, and Section [6] summarizes our 
observations and conclusions. Futher descriptions of the D-Wave 2X, discussion of our experimental 
procedures, and results for the full suite of test problems, may be found in the appendices. 

A note on quantum speedup Rcnniow et al. m searched for “limited quantum speedup” by 
looking for differentiation in scaling trends between the performance of a D-Wave Vesuvius chip 
and the performance of a classical analog—a theoretical thermal annealer with free parallelism 
(i.e., simulated annealing with theoretical constant-time sweeps)—in terms of TTS as a function of 
input size. We take special care here to emphasize that in this paper we do not address the issue 
of quantum speedup; rather our goal is to compare runtime performance strictly within the range 
of problem sizes tested. 
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2 Overview 


We start with an overview of D-Wave design features and introduce notation that will be used 
throughout. For details about underlying technologies see Bunyk et al. |25j . Dickson et al. [26], 
Harris et al. [I], Johnson et al. [2j or Lanting et al. m- 

Ising Minimization D-Wave quantum annealing processors are designed to find minimum-cost 
solutions to the Ising Minimization (IM) problem, defined on a graph G = (V, E ) as follows. Given 
a collection of weights h = {hi : i E V} and J = {Jij : (i,j) E E}, assign values from {—1, +1} to 
n spin variables s = {s*} so as to minimize the energy function 

E(s) — ^ ] hjSj ^ ) JijSiSj. (1) 

i&V (i,j)eE 

The spin variables s can be interpreted as magnetic poles in a physical particle system; in this 
context, negative Jij is ferromagnetic and positive Jij is antiferromagnetic , the optimal solution is 
called a ground state , and non-optimal solutions are excited states. IM instances can be trivially 
transformed to Quadratic Unconstrained Boolean Optimization (QUBO) instances defined on in¬ 
tegers s = {0,1}, or to Maximum Weighted 2-Satisfiability (MAX W2SAT) instances defined on 
booleans s= {true, false}, all of which are NP-hard. 

Chimera topology The native connectivity topology for D-Wave 2X platforms is based on a 
C \2 Chimera graph containing 1152 vertices (qubits) and 3360 edges (couplers). 

A Chimera graph of size C s is an s x s grid of chimera cells, each containing a complete bipartite 
graph on 8 vertices (a A' 4 . 4 ). Each vertex is connected to its four neighbors inside the cell as well 
as two neighbors (north/south or east/west) outside the cell: therefore every vertex has degree 6 
excluding boundary vertices. 

In this study, as in others, we vary the problem size using square subgraphs of the full graph, 
from size C 4 (128 vertices) up to C 12 (1152 vertices). Note that the number of problem variables 
n = 8 s 2 grows quadratically with Chimera size. The reason we measure algorithm performance as 
a function of the Chimera size and not the number of qubits is that problem difficulty tends to 
scale exponentially with the Chimera size, i.e., with the square root of the number of qubits, since 
the treewidth of a Chimera graph C s is linear in s |3J. 

Because the chip fabrication process leaves some small number (typically fewer than 5%) of 
qubits unusable, each processor has a specific hardware working graph H C C 12 • The working 
graph used in this study has 1097 working qubits out of 1152 (see Appendix [A] for an image). 

Quantum annealing. D-Wave processors solve Ising problems by quantum annealing (QA), 
which belongs to the adiabatic quantum computing (AQC) paradigm. The QA algorithm is im¬ 
plemented in hardware using a framework of analog control devices to manipulate a collection of 
qubit states according to a time-dependent Hamiltonian shown below. 

77(f) = A(t) ■ HinU +m • 

'Hprob • ( 2 ) 

This algorithm carries out a gradual transition in time t : 0 —>• t a , from an initial ground state in 
; Hinit , to a state described by the problem Hamiltonian JL pr ob = X]?: Jijcrfa |. The problem 
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Hamiltonian matches the energy function (JTJ) , so that a ground state for R pro b is a minimum-cost 
solution to E(s). The AQC model of computation was proposed by Farhi et al. |28| who showed 
that if the transition is carried out slowly enough the algorithm will find a ground state (i.e. an 
optimal solution) with high probability. 

The D-Wave processor studied here has a chip containing 1097 active qubits (quantum bits) 
and 3045 active couplers made of microscopic loops of niobium connected to a large and complex 
analog control system via an arrangement of Josephson Junctions. When cooled to temperatures 
below 9.3K, niobium becomes a superconductor and is capable of displaying quantum properties 
including superposition, entanglement, and quantum tunneling. Because of these properties the 
qubits on the chip behave as a quantum mechanical particle process that carries out a transition 
from initial state described by Rinit to a problem state described by Rprob [TUI I2U1 27]. 

Theoretical guarantees about solution times for quantum algorithms (found in [28]) assume that 
the computation takes place in an ideal closed system, perfectly isolated from energy interference 
from ambient surroundings. The D-Wave 2X chip is housed in a highly shielded chamber and 
cooled to below 15mK; nevertheless as is the case with any real-world quantum device, it must 
suffer some amount of interference, which has the general effect of reducing the probability of 
landing in a ground state. Thus theoretical guarantees on performance may not apply to these 
systems. We consider any D-Wave processor to be a heuristic solver, which requires empirical 
approaches to performance analysis. 

Modeling performance. Given input instance ( h,J ), a D-Wave computation involves the fol¬ 
lowing steps. 

1. Program. Load (h, J ) onto the chip; denote the elapsed programming/initialization time tj. 

2. Anneal. Carry out the QA algorithm. Anneal time t a can be set by user to some value 
20ps < t a < 20, OOOps. 

3. Read. Record qubit states to obtain a solution; denote the elapsed readout time t r . 

4. Repeat. Iterate steps 2 through 4, R times, to obtain a sample of R solutions. 

We define sample time R t s and total time T as follows: 

t s = ( t a + t r ) (3) 

T = ti + Rt s . 

Performance metrics. As is common with optimization heuristics whether quantum or classical, 
performance is characterized by a trade-off between computation time and solution quality: more 
time yields better results. 

At least two such trade-offs can be identified in the D-Wave performance model: increasing 
anneal time t a and/or increasing the sample size R improves the probability of observing low-cost 
solutions in a sample. In current technologies, growing t a above its 20ps lower bound appears to 
have very little effect on optimization performance (on the inputs classes we’ve considered); in this 
small study we fix t a = 20ps and focus the question of what value of R is needed to achieve a 
target solution quality. 
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Figure 1: Inverse CDF (cumulative distribution 
function) of the sample energy distribution of the 
D-Wave hardware. Vertical lines indicate quantiles 
used to define target energies. 


Figure 2: Expected D-Wave STT values for differ¬ 
ent target energies. Red line segments indicate the 
disparity between 1/q and expected STT. 


The STT metric defined in the next section provides an expected value for R based on analysis 
of the distribution of solution samples returned by the D-Wave 2X processor chip. The TTT metric 
calculates expected time (anneal time or total time) to reach a target solution quality that depends 
on the expectation STT. 

3 The time-to-target (TTT) metric 

TTT is defined with respect to a parameter q, a specific reference solver (in this case the D-Wave 
2X), and a specific input as follows. First, we gather many samples from the reference solver. 
Then, based on the energy distribution of these samples, we identify the target energy E q that is 
found at quantile q of the sample distribution. For example, in this study we take 50,000 samples 
from the reference solver for each input; for q = 0.001, we would sort the 50,000 sample energies 
from lowest to highest and take the 50th as the target energy. 

We want this energy to be on the lower tail of the energy distribution, so that it is non-trivial 
to reach, but not so low that it is an extreme outlier. 

For each input, target energy E q , and solver, we define the samples-to-target (STT) as the 
expected number of samples required by the solver to attain an energy at or below the target 
energy. 

Example Figure [I] shows the quantiles (inverse cumulative distribution function) of the sample 
energy distribution of 50,000 reads from the DW2X on a single input. We define three target 
energies for quantiles q = 0.001,0.01,0.1, respectively, corresponding to the best 0.1 percent, 1 
percent, and 10 percent of energies in the sample. 

It is natural to intuit that, for a given quantile q, STT would simply be 1/q. However, 1/q 
only serves as an upper bound on STT: due to quantization of energy levels the actual STT can 
be much lower than this bound. For example in Figure [l] the energy -1890 found at q = 0.001 
(suggesting STT = 1000) is also found at q = 0.002 and higher quantiles, which means that STT 
< 500 for this input. 
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In Figure [2] we plot what is essentially the inverse of Figure [l] The expected number of samples 
required by the D-Wave hardware to hit a given target energy is the reciprocal of the probability 
mass function at the target energy. In this example, for quantiles 0.001, 0.01, and 0.1, the STT 
values are 413.2, 25.58, and 7.029, lower than their upper bounds of 1000, 100, and 10. 

TTT is the expected time required by the solver to reach the target energy, found by combining 
STT with computation times. We consider two versions of this metric: total time (TTT to t a i) and 
annealing only (TTT annea i). The two versions serve different purposes in our analyses. TTT tota i 
measures the entire computation start to finish, including chip programming and sample times. 
This a more realistic metric that corresponds to the wall clock time a user would experience. 
TTT annea i, on the other hand, includes only the anneal time needed to solve the problem—this 
more precisely captures the fundamental performance properties of the algorithms considered here, 
and allows a better view of scaling performance. Note that I/O times dominate computation costs 
in current processor models; this overhead cost has been significantly reduced compared to past 
generations of the processor and is expected to continue to decrease with future releases. 

Timing is broken down into programming/initialization time, anneal time , and readout time, 
respectively ti, t a , and t r . For the DW2X these are defined in the previous section. The software 
solvers used in our comparison study are also designed to return a number of samples for each 
input; for these, ti corresponds to initialization/constructor time, t a corresponds to the bare-bones 
time to generate one sample, and t r is considered negligible and recorded as 0 (see Appendix [B| for 
details). Let pt be the probability that a single sample will reach the target energy, as estimated by 
the sample statistic pt, which is equal to 1/STT. The use of gauge transformations for the DW2X 
requires us to also consider p g , the probability that a single gauge transformation will contain 
a successful sample, as estimated by the sample statistic p g (see [3] for a description of gauge 
transformations). Each gauge transformation requires an additional programming step, adding 
time ti to total computation time. For software solvers we define p g = 1. With these in hand, we 
have: 


STT = l/p t 


TTT annea i = t a /p t 


TTT total = ^ + 
Pg 


t a T t r 
Pt 


See Appendix [B] for details of our time measurement procedures. 

Thus, for given q the TTT metric identifies as a target the lowest solution energy the D-Wave 
2X processor can expect to find within its first STT < 1/q samples. In this study we use quite 
small quantiles q = 0.001,0.01,0.1 which correspond to sample size bounds of 1000,100,10: the 
D-Wave 2x Ends these samples using total computation times of 352ms, 46ms, and 15ms at most. 

The TTT comparison essentially asks how much time competing software solvers need to find 
solution energies of quality that matches the target energies found by D-Wave 2X processor within 
very short time limits (from 15ms to 352ms). 
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4 Experimental setup 


The goal of this study is to illustrate some properties of the TTT metric by evaluating the per¬ 
formance of the latest generation D-Wave processor on Chimera-structured Ising problems. This 
section outlines our experimental procedure. 

4.1 Problems 

The problem classes we consider here fall into three categories: 

RANr In RANr problems, for integer r, each h value is 0 and each J value is chosen uniformly 
from the integers in the range [— r, r], excluding 0. We consider classes RANI, RAN3, RAN7, and 
RAN127. 

ACfc-odd The AC k problem class can be generated by taking the RANI problem class and 
multiplying each inter-tile coupling by k. These problems appear to be more challenging than 
simple random problems by removing a type of “local optimality” that is otherwise created within 
each tile. We consider AC3 problems, in which each in-tile coupling is chosen uniformly from ±1 
and each inter-tile coupling is chosen uniformly from ±3. The AC3-odd problem class additionally 
adds fields to reduce degeneracy—for each spin whose incident couplings have an even sum, we 
add a field chosen uniformly from ± 1 . 

FLr The class of frustrated loop problems with bounded precision r defined by King [T9] as 
a modification of the original idea of Hen et al. m- These are constraint satisfaction problems 
generated for a specific input graph wherein the Hamiltonian range is limited to r. We consider FL3 
problems generated with a constraint-to-qubit ratio of 0.25, the hardest regime for these problems. 

For each problem class tested we generated 100 inputs per problem size over a range of problem 
sizes from C 4 (128 qubits) up to C 12 (1097 qubits). 

4.2 Solvers 

The reference solver, and our primary interest in this study, is a D-Wave 2X quantum annealing 
processor currently online at D-Wave headquarters. For comparison we use representatives from 
the two fastest known families of classical software solvers for Chimera-structured Ising problems: 
simulated annealing (SA) and the Hamze-de Freitas-Selby algorithm (HFS). 

Each solution from each solver is post-processed by applying a simple greedy descent procedure 
(greedily flipping spin values) to move the solution to a local minimum. This removes some noise 
from the metrics and for these problem classes can be done very quickly. It also makes our 
parameterization of simulated annealing more robust, reducing the impact of final temperatures 
that are too warm. 

Throughout, for each input, we use software times corresponding to the fastest (best-tuned) 
version available from each software family. This essentially assumes the existence of a hypothetical 
clairvoyant portfolio solver that instantly chooses the best parameter settings for the input and 
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knows exactly how long to run the solver. We have not similarly tuned the D-Wave processor for 
best performance; instead we use default parameter settings for all problems. 

We note that this portfolio approach is somewhat contrary to accepted benchmarking guidelines 
(010), which recommend against too-aggressive tuning and post hoc selection of solution methods. 
Because of this choice the software times reported here correspond to an optimistic usage scenario 
that is unlikely to arise in practice. However, overestimating DW2X times using untuned perfor¬ 
mance data, and underestimating software times using unrealistically tuned data, allows us to find 
a lower bound on the size of the performance gap that would be observed by a user. See Appendix 
[B]for more discussion of time measurement and Appendix |C] for our procedures for tuning SA. 

D-Wave 2X system The DW2X is the latest generation of D-Wave quantum annealer, launched 
in summer 2015. We use a fixed annealing time of 20ps and fixed rethermalization time of lOOps 
throughout. Post-processing (greedy descent) is applied in all cases. Our calculation of TTT to tai 
assumes that gauge transformations might be applied and includes the extra programming cost 
when appropriate. 

Simulated annealing Simulated annealing [22] is a canonical optimization algorithm that sim¬ 
ulates thermal annealing, the classical analog to quantum annealing. We use an implementation 
of simulated annealing developed in-house. This version of simulated annealing is optimized for 
Chimera architectures and accepts Hamiltonians stored as single-precision floating point numbers. 

Following Rprinow et al. m we carried out extensive pilot tests to find optimal parameter 
settings for (in-house) SA, and report the lowest computation time observed for each input and 
metric. See Appendix 0 for details. While our implementation of SA should be as good as 
any other in terms of solution quality, it is not the fastest. Isakov et al. [30] developed highly 
optimized simulated annealing codes—one particularly effective technique they use is the recycling 
of randomly generated numbers; this leads to a significant speedup at the cost of introducing a 
small amount of correlation between the samples. Two of their specific SA codes are of interest 
in this study: an_ms_rl_nf (for RANI problems) and an_ss_ge_fi (for other problem classes). 
The an_ms_rluif solver only works for one class of problems—RANI problems—but achieves a 
huge speed increase by exploiting word-level parallelism to run 64 replicas at a time using bitwise 
arithmetic. 

We estimate TTT metrics for these two solvers by using the STT results of our in-house SA, 
then converting these STT results to TTT results using the single-threaded timings reported by 
Isakov et al. m, which were also measured on an Intel Xeon E5-2670. For all TTT timings reported 
for SA in this study, we report the TTT value for whichever time is best for the specific input: 
the measured time of our in-house SA, the estimated time of an_ss_ge_fi, and, if applicable, the 
estimated time of an_ms_rl_nf. Again, this is with STT determined using the optimal number of 
sweeps and better cooling schedule, selected a posteriori on a per-input basis. 

We note that, according to accepted algorithm benchmarking guidelines 010, a “fair test” 
requires that solvers used in a given study should be matched by scope; in particular every solver 
should be able to read and solve every input. Therefore an_ms_rl_nf should arguably be disqualified 
from this study because it can only read one of the 6 input classes used. Nevertheless we include 
it in the RANI tests to identify the extreme boundaries of single-threaded software capabilities. 
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HFS algorithm We use a slightly modified version of Selby’s implementation [15] of the HFS 
algorithm ns m- This algorithm performs repeated optimizations on low-treewidth induced 
subgraphs of the input graph. It performs a local neighborhood search like SA, but with a much 
larger neighborhood, and thus slower but more effective updates. Selby’s solver can run in different 
modes depending on the treewidth of the subgraphs to optimize over. For this study we tried the 
treewidth 4 solver (GS-TW1) and the treewidth 8 solver (GS-TW2). While GS-TW2 shows better 
performance in finding ground states for large problems [2], GS-TW1 showed better time-to-target 
performance in every case in this study; for this reason we only report results and timings from 
GS-TW1. 

In its original form, Selby’s implementation searches repeatedly for minima and discards any 
local optima that turn out to not be globally optimal; his timing procedure also ignores some 
initial “burn in” time that is spent finding non-optimal solutions. We use the solver in what we 
call “sampling mode”, which returns all local minima found as independent samples and records 
time-per-sample for each, which corresponds to anneal times for SA and DW2X. Since we do not 
discard local minima, this version of the solver is faster with respect to the TTT metric. 

4.3 Metrics 

To define target energies for each input we drew 50,000 samples from the D-Wave 2X system, 
taking 1,000 samples from each of 50 random gauge transformations. Each gauge transformation 
slightly perturbs the nominal Hamiltonian: this reduces bias and creates a more accurate reference 
distribution from which sample quantiles may be drawn. We find target energies for quantiles 
q = 0.001, 0.01, and 0.1, which give upper bounds on STT of 1000, 100 and 10 hardware samples, 
respectively. For the DW2X, these sample counts correspond respectively to total computation 
times of 352ms, 46ms, and 15ms, and to anneal times of 20ms, 2ms, and 0.2ms. 

To calculate STT for each input, we used the same 50,000 samples from the DW2X. For HFS 
and each parameterized version of SA we calculated STT based on 1,000 samples for each input; 
the use of smaller sample sizes creates coarser increments of sample counts STT in these cases, 
but this step was necessary given the large computation times required by software to return 1000 
samples; see Appendix A for details. 

5 Results 

Our results for the full suite of input classes appear in Appendix [Dj Here we show selected examples 
that highlight the range of relative performance advantages of the D-Wave 2X processor. 

Figure |3] shows median STT performance (over 100 inputs) and 95% confidence interval^] for 
RANI problems. Left to right, the three panels correspond to q = 0.001 (harder targets), q = 0.01, 
and q = 0.1 (easier targets). The x-axis shows increasing Chimera sizes C 4 to C 12 ; recall that the 
number of qubits (variables) in a C s problem is approximately 8 s 2 . The y-axis (log scale) shows 
expected STT. For DW2X, the blue curves correspond to sample counts bounded above by 1000, 
100, and 10, for increasing quantiles q: here we see STS values of approximately 300, 40, and 
8 , respectively at C 12 size. The confidence intervals are small, corresponding to a narrow range 
of observations over all inputs. By definition, the set of possible STT values for the DW2X are 

4 Throughout this paper we report confidence intervals for medians derived analytically from the sample quantiles 
(see, e.g., m ). For 100 inputs the 95% Cl for the median is bounded by the 40th and 60th percentiles. 
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-J- HFS 
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Figure 3: Performance of solvers in STT for RANI problems. Shown are the medians (over 100 input in¬ 
stances) for each solver and problem size, with error bars showing 95% confidence intervals. SA performance 
corresponds to SA10000 (10000 sweeps per anneal) which is optimal in this metric. 
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Figure 4: Performance of solvers in TTT annea i for RANI problems. Shown are medians for each solver and 
size with error bars showing 95% confidence intervals. SA times are estimated for the an_ms_rl_nf solver; 
see Section [6] for details. 


11 



















Solver 

-T- DW2X 

-J- HFS 
—J— SA 


Figure 5: Performance of solvers in TTT to tai for RANI problems. Shown are medians for each solver and 
size with error bars showing 95% confidence intervals. For this input class, SA times are estimated and 
correspond to the an_ms_rl_nf solver described in [3U|; see Section [6l for details. 


bounded above by 1000, 100, and 10, which are quite close to the median values shown at largest 
problem sizes. We observe similarly tight ranges for the software solvers, and there is no reason to 
expect heavy tails as observed by Steiger et al. [32] in the STS metric. 

The fact that STT for the reference solver (DW2X) tends to grow with problem size rather 
than staying constant appears to be an artifact of the discretized nature of the solution space 
and the fact that smaller problems are easier to solve. For example, for q = 0.001 at C 12 size, the 
proportion of samples that hit the target is 0.003; at Cs the proportion is a much higher 0.025—the 
smaller problems have fewer distinct energies to choose from as well as a higher relative number of 
good-quality solutions in the sample. STT for the reference solver approaches the upper bound of 
1/q much more quickly for high-precision inputs than for low-precision inputs. 

Among SA solvers we tested, SA 10000 (using the maximum number of sweeps per sample that 
we tested) is always optimal for STT because this metric essentially assumes constant cost per 
sample: since more sweeps are free it is always best to maximize the number of sweeps per sample. 
The comparatively flat scaling for SA in this figure indicates that 10000 sweeps is more than ample 
time to solve these problems: at q = 0.1 we see STT < 2 throughout, indicating that 50 percent of 
the samples hit this target. 

It is interesting that the DW2X and SA curves have approximately the same shape in these three 
charts, suggesting that the distributions of solutions returned by these solvers are quite similar, 
at least for quantiles q = 0.1 and smaller. The HFS algorithm shows relatively different behavior 
indicating a different distribution: worse than DW2X and SA at hitting targets in the smallest 
quantile (harder problems) but better at hitting targets in the biggest quantile (easier problems). 

Although SA 10000 is best in STT, the large time-per-sample it requires (approximately 0.16ps 
per sweep or 1.6ms per sample at C 12 size), means that it is not necessarily the best choice for the 
TTT metric. Figure [4] shows TTT annea i for these problems, calculated by multiplying STT by the 
time-per-anneal for each solver. We have the following observations: 

• For DW2X, time per anneal is a constant 20ps. Therefore scaling of TTT annea i is identical 
to that for STT: for example both STT and TTT annea i increase through about 2.5 orders of 
magnitude at q = 0.001, and increase about 3-fold at q = 0.1. 
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• For SA, time per anneal is the product of time per sweep and the number of sweeps per anneal. 
Time per sweep is proportional to the number of problem variables and grows quadratically in 
C s . In this and subsequent figures we display results for the optimal number of sweeps found 
in extensive pilot tests of a variety of solver parameters. For RANI inputs the best version 
of SA corresponds to anuns_rl_nf and the runtimes shown are based on estimates from data 
published elsewhere. See Appendix [B] for details about software times and Appendix [C] for 
our tuning procedures for SA. 

Comparing the green curves in Figures [3] and [4] we see that STT grows by factors near lOx, 
lOx, and 2x in each panel, while TTT grows by about lOOOx, lOOOx, and lOOx, respectively, 
as problem size increases about 8-fold from C 4 (about 128 variables) to C 12 (1097) variables. 
Thus, roughly speaking, a lOOOx fold increase in computation time is a combination of an 
8 -fold increase in time-per-sweep and a 125-fold increase in sweeps-per-solution (proportional 
to STT for this particular version of SA) through the problem range. 

• For HFS, time per anneal varies depending on how long it takes to hit a local minimum. In 
this figure TTT scaling resembles that of SA. 

• In all cases, at C 12 sizes, TTT annea i is between 8x and about 80x faster than the best com¬ 
peting software solvers we know of. 

• Although the scaling curves shown here give a good idea of the relative amount of work each 
solver performs in terms of computation time, we caution against trying to extend these 
curves to larger problem sizes or other quantiles. For example the shapes of the curves seen 
here are an artifact of how this choice of quantiles “cuts” the distribution solution energies, 
and would be different, for example, if the target were selected by fixing computation times. 
Much more work is needed to understand the tradeoff between computation time and solution 
quality that is offered by each solver; this is an interesting topic for future research. 

Figure [5] shows TTT tota i for RANI problems; this metric includes initialization time and (in the 
case of DW2X) readout time and reflects the actual user experience using these solvers. The flat 
scaling of the blue DW2X curve on the center and right panels, and at small C s values on the left 
panel, reflects the fact that the constant programming time of 11.6ms dominates total computation 
time. Programming time dominates whenever the number of samples is less than 36; in particular 
for q = 0.1 the upper bound of 10 reads (3.4ms) ensures that this curve will always appear flat 
(see the discussion of component times in Section [ 2 ]). Only at largest problems and at the hardest 
quantile (q = 0.001) do we see computation times grow above this threshold. 

In all three quantiles DW2X is faster in total computation time than the HFS algorithm at the 
largest problem sizes. In all three quantiles the SA solver is fastest; estimated an_ms_rl_nf times 
are lowest among all solvers considered, at all problem sizes. Recall that this solver exploits bit par¬ 
allelism to achieve extremely fast computation times at the price of extremely narrow applicability 
to the 1-bit problems tested here. 

Figures 4 through 6 illustrate the strongest relative performance in TTT that we have observed 
for software solvers. The next three figures illustrate the other extreme. In Figures [6] and [7] we show 
TTT ann ea i and TTT tota i, respectively, for FL3 problems. For TTT annea i, at the largest problem 
sizes the DW2X is about 600x faster than any competitor for all quantiles measured. As with 
RANI problems, for the DW2X, the programming time dominates TTT tota i in nearly all cases. 
STT plots for FL3 and other problem classes can be found in Appendix [P| 
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Figure 6: Performance of solvers in TTT annea i for frustrated loop (FL3) problems. Shown are medians 
over 100 random inputs for each solver and size with error bars showing 95% confidence intervals. 



q = 0.1 
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-f- DW2X 

-J- HFS 
—J— SA 


Figure 7: Performance of solvers in TTT tota i for frustrated loop (FL3) problems. Shown are medians over 
100 random inputs for each solver and size with error bars showing 95% confidence intervals. 
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Figure 8: Relative performance of the DW2X vs. other solvers in TTT annea i. Shown are medians for each 
problem class and size with error bars showing 95% confidence intervals. 
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Figure 9: Relative performance of the DW2X vs. other solvers in TTT tota i. Shown are medians for each 
problem class and size with error bars showing 95% confidence intervals. 
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Measuring D-Wave relative performance By dividing the best software solver’s TTT by 
the TTT of the DW2X for each input, we can measure how much faster or slower the D-Wave 
processor is than the strongest competitor on a per-input basis. We plot this relative performance 
for TTT ann ea i and TTT tota ] in Figures [8] and [9] respectively. 

As a function of q, the relative performance of the D-Wave 2X system over the software solvers 
generally peaks at q = 0.1 for TTT annea i and at q = 0.01 for TTT tota i. This range seems to be 
the sweet spot for the DW2X, where it has gathered enough samples to ensure a good solution, 
yet the software solvers have not had enough time to catch up. For TTT totE j, the D-Wave relative 
performance is notably lower for q = 0.001; this is partially due to the possibility of requiring 
multiple gauge transformations to reach the target energy. It is possible that D-Wave’s relative 
performance in this metric could be improved by increasing the number of samples per gauge 
transformation. 

On RANr problems there is a steady degradation of SA as a competitor as precision increases 
(see Appendix 0. even without considering the anuns_rl_nf solver. This may be because as 
precision increases, degeneracy decreases, meaning the problems are harder for all solvers. It may 
also be because the cooling schedules we use for SA are inappropriate for higher-precision instances. 
While we have made a good-faith effort to use SA optimally, in particular by using reasonable 
cooling schedules and optimizing the number of sweeps as suggested by Rpnnow et al. m , it is 
possible that different parameterization could lead to better performance, and we invite others to 
try. In the meantime, for high-precision random instances we are content to let HFS stand as the 
most competitive software solver. 

6 Discussion 

In these TTT metrics, with the exception of the RANI problem class, the single-threaded software 
solvers evaluated here have not kept up with the D-Wave hardware at the full 1097-qubit problem 
scale. While it is possible that a new algorithm could be developed that could beat the DW2X in 
these metrics on a single thread, and we encourage researchers to continue such efforts, we believe 
that the real question has now turned to multithreaded, multi-core software solvers. 

Isakov et al. m evaluated parallelized versions of their simulated annealing code with up to 16 
threads, but this would be insufficient to match the anneal-time-only performance of the DW2X in 
many cases, even with idealized perfect parallelism. For these TTT metrics it remains unclear how 
many CPU cores it would take to match the performance of the DW2X. It bears investigating this 
question using actual timing on multi-CPU platforms, where memory access and communication 
costs are likely to dominate single-core instruction times at least as much as programming and 
readout times dominate the quantum annealer. 

While this study has only used CPU-based software solvers, GPUs are becoming increasingly 
popular as sources of cheap parallelism and are a viable means to fast, cheap Monte Carlo simula¬ 
tion. We are currently investigating GPU-based algorithms to determine how many GPU cores it 
would take to match the DW2X in these TTT metrics. 
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A DW2X hardware working graph 



Figure 10: Hardware graph used in the D-Wave 2X system. The C 12 Chimera architecture consists of a 
12 x 12 grid of Chimera unit tiles, each having 8 qubits. Imperfections in the fabrication process lead to 
inactive qubits; in this chip 1097 of the 1152 qubits are active. 
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B Solver timing 

DW2X timing 

The timing for the DW2X does not depend on the size or class of the problem. 


Programming time 

Anneal time 

Readout time 

11.6ms 

20ps 

320ps 


Table 1: Timing components for the D-Wave 2X system. 


Software timing 

Software solvers were run single-threaded on an Intel Xeon E5-2670 processor. For HFS and our 
in-house SA implementation we measure initialization time and time per sweep as a function of 
problem class and size. To measure initialization time, we timed the initialization method 10 times 
and took the minimum initialization time. To measure time per sweep, we measured the time 
required to perform hundreds of thousands of sweeps and took the mean. Initialization times and 
times per sweep for SA and HFS are shown in Figures [TT||12[[l3l and|14| Note that time per sweep 
for both solvers grows linearly with the number of spins (and thus quadratically with the Chimera 
size), as we would expect. 

For the an_ss_ge_fi and anjms_rl_nf solvers we used initialization times and spin update 
times reported by Isakov et al. [30] (which were also measured on an Intel Xeon E5-2670). For 
the annns_rl_nf solver, samples are requested in batches of 64. For bookkeeping, we define the 
empirical success probability of a batch hitting the target energy as p' t = 1 — (1 — p*) 64 and the 
annealing time of a batch as the time required to anneal 64 samples. A “spin flip” corresponds to the 
cost of updating a single problem variable (a spin) in a single iteration of an inner loop; a “sweep” 
corresponds to a single iteration through all n problem variables. Table [2] shows initialization time 
and time per sweep at the largest C \2 problem size. 


Solver 

Init 

Spin flips per ns 

Time per spin flip 

Time per sweep (1097 spins) 

an_ms_rl_nf 

0.6ms 

6.65 

0.15ns 

0.16ps 

an_ss_ge_f i 

69.0ms 

0.30 

3.33ns 

3.66ps 


Table 2: Timing components for simulated annealing codes of Isakov et al. m- 

For SA the conversion from sweep time to total time is trivial since the number of sweeps is 
fixed. For HFS, on the other hand, the number of “tree sweeps” (i.e., low-treewidth updates) per 
sample is determined internally, as the algorithm simply descends to a local minimum using as 
many sweeps as it takes. In this case we measure the mean number of “tree sweeps” per sample 
returned. 


C Tuning SA 

Our implementation of SA has two main parameters: number of sweeps per anneal, and the choice 
of cooling schedule. To optimize the number of sweeps per anneal, we tried values in the set {10, 
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Figure 11: Initialization time for SA. 


Figure 12: Time per sweep for SA. 
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Figure 13: Initialization time for HFS. 


Figure 14: Time per sweep for HFS. 
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20, 40, 100, 200, 400, 1000, 2000, 4000, 10000} for all inputs. We also tried each solver using two 
cooling schedules: the first uses an ‘unsealed’ schedule so that the inverse temperature (3 increases 
linearly from 0.01 to 3; the second uses a ‘scaled’ schedule that increases from 0.01/r to 3/r, 
where r is the range of the Hamiltonian, i.e., the largest absolute value of any component of the 
Hamiltonian. The unsealed version is too fast for high precision inputs (RAN127 and RAN7) and 
the scaled version was used instead. For each problem class we report only results from the better 
of the two cooling schedules and best choice of sweeps per anneal. 

D Full results 

In this appendix, for each problem class we show plots for STT, TTT annea i, and TTT tota i. Note 
that the STT plots show results for a single SA solver fixed at 10,000 sweeps. This is always the 
best SA solver in terms of STT, though it is only the best SA solver in terms of TTT metrics at 
the largest problem sizes. Each plot shows medians for each solver for each problem size with error 
bars showing 95% confidence intervals. 

The TTT plots in this appendix show the three versions of simulated annealing, which for 
the sake of cleanliness we combine in the main body. For RANI inputs, an_ms_rl_nf is always 
the fastest SA implementation. For other inputs, our estimated an_ss_ge_fi times are lower by a 
constant factor than those of our SA code, except on small inputs where our in-house SA is faster 
by virtue of its lower initialization cost. 
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